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Abstract 

Authors of biomedical publications often use 
gel images to report experimental results such 
as protein-protein interactions or protein ex- 
pressions under different conditions. Gel im- 
ages offer a way to concisely communicate 
such findings, not all of which need to be ex- 
plicitly discussed in the article text. This fact 
together with the abundance of gel images 
and their shared common patterns makes them 
prime candidates for image mining endeav- 
ors. We introduce an approach for the detec- 
tion of gel images, and present an automatic 
workflow to analyze them. We are able to de- 
tect gel segments and panels at high accuracy, 
and present first results for the identification of 
gene names in these images. While we cannot 
provide a complete solution at this point, we 
present evidence that this kind of image min- 
ing is feasible. 

1 Introduction 

A recent trend in the area of literature mining 
is the inclusion of images in the form of figures 



from biomedical publications ( |Yu and Lee, 2006} 
Zweigenbaum et al., 20071 |Peng, 2008[ ). This de- 
velopment benefits from the fact that an increas- 
ing number of scientific articles are published as 
open access publications. This means that not just 
the abstracts but the complete texts including images 
are available for data analysis. Among other things, 
this enabled the development of query engines for 
biomedical images like the Yale Image Finder (Xu 



et al., 2008 1 and the BioText Search Engine (Hearst 



etal, 2007). 



Gel images are a very frequent type of image 
in the biomedical literature. They are the result of 
gel electrophoresis, which is a common method to 
analyze DNA, RNA and proteins. Southern, West- 
ern and Northern blotting ( Southern, 1975[ Alwine 
et al., 1977] |Burnette, 198 1[ ) are among the most 
common applications of gel electrophoresis. The re- 
sulting experimental artifacts are often shown in 
biomedical publications in the form of gel images as 
evidence for the discussed findings such as protein- 
protein interactions or protein expressions under 
different conditions. According to our experience, 
about 15% of all subfigures (i.e. independent parts of 
a figure) are gel images. Often, not all details of the 
results shown in these images are explicitly stated in 
the caption or the article text. For these reasons, it 
would be of high value to be able to reliably mine 
the relations encoded in these images. 

A closer look at gel images reveals that they fol- 
low regular patterns to encode their semantic rela- 
tions. Figure [TJ shows two typical examples of gel 
images together with a table representation of the 
involved relations. The ultimate objective of our ap- 
proach (for which we can only present a partial solu- 
tion here) is to automatically extract at least some of 
these relations from the respective images, possibly 
in conjunction with classical text mining techniques. 
The first example shows a Western blot for detect- 
ing two proteins (14-3-3<r and /3-actin as a control) 
in four different cell lines (MDA-MB-231, NHEM, 
C8161.9, and LOX, the first of which is used as a 
control). There are two rectangular gel segments ar- 
ranged in a way to form a 2 x 4 grid for the indi- 
vidual eight measurements combining each protein 
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Figure 1: Two examples of gel images from biomedical publications (PMID 19473536 and 15125785) with tables 
showing the relations that could be extracted from them 



with each cell line. A gel diagram can be consid- 
ered a kind of matrix with pictures of experimental 
artifacts as content. The tables to the right illustrate 
the semantic relations encoded in the gel diagrams. 
Each relation instance consists of a condition, a mea- 
surement and a result. The proteins are the entities 
being measured under the conditions of the different 
cell lines. The result is a certain degree of expression 
indicated by the darkness of the spots (or brightness 
in the case of white-on-black gels). The second ex- 
ample is a slightly more complex one. Several pro- 
teins are tested against each other in a way that in- 
volves more than two dimensions. In this case, the 
use of "+" and "-" labels is a frequent technique to 
denote the different possible combinations of a num- 
ber of conditions. Apart from that, the principles are 
the same. In this case, however, the number of rela- 
tions is much larger. Only the first eight of overall 32 
relation instances are shown in the table to the right. 
In such cases, the text rarely mentions all these re- 
lations in an explicit way, and the image is therefore 
the only accessible source. 



2 Background 

In principle, image mining involves the same pro- 



cesses as classical literature mining (De Bruijn and 



Martin, 2002| ): document categorization, named en- 
tity tagging, fact extraction, and collection-wide 
analysis. However, there are some subtle differ- 
ences. Document categorization corresponds to im- 
age categorization, which is different in the sense 
that it has to deal with features based on the two- 
dimensional space of pixels, but otherwise the same 
principles of automatic categorization apply. Named 
entity tagging is different in two ways: pinpointing 
the mention of an entity is more difficult with images 
(a large number of pixels versus a couple of charac- 
ters), and OCR errors have to be considered. Fact 
extraction in classical literature mining involves the 
analysis of the syntactic structure of the sentences. 
In images, in contrast, there are rarely complete sen- 
tences, but the semantics is rather encoded by graph- 
ical means. Thus, instead of parsing sentences, one 
has to analyze graphical elements and their relation 
to each other. The last process, collection-wide anal- 
ysis, is a higher-level problem, and therefore no fun- 
damental differences can be expected. Thus, image 
mining builds upon the same general stages as clas- 



sical text mining, but with some subtle but important 
differences. 

Image mining on biomedical publications is not 
a new idea. It has been applied for the extrac- 



tion of subcellular location information (Murphy et 



al., 2004), the detection of panels of fluorescence 



microscopy images ( |Qian and Murphy, 2008[ ), the 
extraction of pathway information from diagrams 



(Kozhenkov and Baitaluk, 2012), and the detection 



of axis diagrams (Kuhn et al., 2012). Also, there is 
a large amount of existing work on how to process 



gel images ( 


Lemkin, 1997 


; Luhn et al, 2003 


Cutler 


et al, 2003 


Rogers et al., 2003| Zerr and Henikoff, 


2005 ) and databases have been proposed to store the 



results of gel analyses (Schlamp et al., 2008 1. These 
techniques, however, take as input plain gel images, 
which are not readily accessible from biomedical 
papers, because they make up just parts of the fig- 
ures. Furthermore, these tools are designed for re- 
searchers who want to analyze their gel images and 
not to read gel diagrams that have already been an- 
alyzed and annotated by a researcher. Therefore, 
these approaches do not tackle the problem of rec- 
ognizing and analyzing the labels of gel images. 
Some attempts to classify biomedical images in- 



clude gel figures (Rodriguez-Esteb an and Ios sifov, 



2009), which is, however, just the first step in locat- 
ing them and analyzing their labels and their struc- 
ture. To our knowledge, nobody has yet tried to per- 
form image mining on gel diagrams. 

3 Approach and Methods 

Figure|2]shows the procedure of our approach to im- 
age mining from gel diagrams. It consists of seven 
steps: figure extraction, segmentation, text recogni- 
tion, gel detection, gel panel detection, named entity 
recognition and relation extraction. 

Using structured article representations, the first 
step is trivial. For the steps two and three, we rely on 
existing work. The focus of this paper lies on steps 
four, five and six: the detection of gels and gel pan- 
els and the recognition of named entities. We sketch 
how step seven could be implemented, but we can- 
not provide a solution at this point. 

To practically evaluate our approach, we ran our 
pipeline on the entire open access subset of PubMed 
Central (though not all figures made it through the 



whole pipeline due to technical difficulties). 

3.1 Figure Extraction 

A large portion of the articles of the open access sub- 
set of the PubMed Central database are available as 
structured XML files with additional image files for 
the figures. We only use these articles so far, which 
makes the figure extraction task very easy. It would 
be more difficult, though definitely feasible, to ex- 
tract the figures from PDF files or even bitmaps of 
scanned articles. 

3.2 Segmentation and Text Recognition 

For the next two steps — segment detection and 
subsequent text recognition — , we rely on our pre- 
vious work ( |Xu and K rautham mer, 2010} |Xu and 



Krautham mer, 201 1[ ). This method includes the de- 
tection of layout elements, edge detection, and text 
recognition with a novel pivoting approach. For opti- 
cal character recognition (OCR), the Microsoft Doc- 
ument Imaging package is used, which is available 
as part of Microsoft Office 2003. Overall, this ap- 
proach has been shown to perform better than other 
existing approaches for the images found in biomed- 



ical publications ( |Xu and Krauthammer, 2010 1. We 
do not go into the details here, as this paper focuses 
on the subsequent steps. 

Due to some limitations of the segmentation al- 
gorithm when it comes to rectangles with low inter- 
nal contrast (like gels), we applied a complementary 
very simple rectangle detection algorithm. 

3.3 Gel Segment Detection 

Based on the results of the above-mentioned steps, 
we try to identify gel segments. Such gel segments 
typically have rectangular shapes with darker spots 
on a light gray background, or — less commonly 
— white spots on a dark background. We decided 
to use machine learning techniques to generate clas- 
sifiers to detect such gel segments. To do so, we 
defined 39 numerical features for image segments: 
the coordinates of the relative position (within the 
image), the relative and absolute width and height, 
16 grayscale histogram features, three color features 
(for red, green and blue), 13 texture features based 
on Haralick et al. (1973 ), and the number of recog- 



nized characters. 
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Figure 2: The procedure of our approach: (1) figure extraction, (2) segmentation, (3) text recognition, (4) gel detection, 
(5) gel panel detection, (6) named entity recognition, and (7) relation extraction. 



To train the classifiers, we took a random sam- 
ple of 500 figures, for which we manually annotated 
the gel segments. In the same way, we obtained a 
second sample of another 500 figures for testing the 
classifiers. We used the Weka toolkit and opted for 
random forest classifiers based on 75 random treesQ 
Using different thresholds to adjust the trade-off be- 
tween precision and recall, we generated a classifier 
with good precision and another one with good re- 
call. Both of them are used in the next step. 

3.4 Gel Panel Detection 

A gel panel typically consists of several gel seg- 
ments and comes with labels describing the involved 
genes, proteins, and conditions. For our goal, it is 
not sufficient to just detect the figures that contain 
gel panels, but we also have to extract their posi- 
tions within the figures and to access their labels. 
This is not a simple classification task, and therefore 
machine learning techniques do not apply that eas- 
ily. For that reason, we used a detection procedure 
based on hand-coded rules. 

In a first step, we group gel segments to find con- 
tiguous gel regions that form the center part of gel 
panels. To do so, we start with looking for segments 
that our high-precision classifier detects as gel seg- 
ments. Then, we repeatedly look for adjacent gel 
segments, this time applying the high-recall classi- 
fier, and merge them. Two segments are considered 
neighbors if they are at most 50 pixels aparj^jand do 
not have any text segment between them. Thus, seg- 
ments which could be gel segments according to the 
high-recall classifier make it into a gel panel only if 

1 We also tried other types of classifiers including support 
vector machines, but we achieved the best results with random 
forests. 

2 We are using absolute distance values at this point. A more 
refined algorithm could apply some sort of relative measure. 
However, the resolution of the images does not vary that much, 
which is why absolute values worked out well so far. 



there is at least one high-precision segment in their 
group. The goal is to detect panels with high preci- 
sion, but also to detect the complete panels and not 
just parts of them. In the given situation, precision is 
more important than recall, because low recall can 
be leveraged by the large number of available gel 
images. 

As a next step, we collect the labels in the form 
of text segments located around the detected gel re- 
gions. For a text segment to be attributed to a cer- 
tain gel panel, its nearest edge must be at most 30 
pixels away from the border of the gel region and 
its farthest edge must not be more than 150 pixels 
away. We end up with a representation of a gel panel 
consisting of two parts: a center region containing 
a number of gel segments and a set of labels in the 
form of text segments located around the center re- 
gion. 

To evaluate this algorithm, we collected yet an- 
other sample of 500 figures. For these, we manually 
checked whether the algorithm is able to detect the 
presence and the (approximate) position of the gel 
panels. 

3.5 Named Entity Recognition 

The next step is to recognize the named entities men- 
tioned in the gel labels. To this aim, we investi- 
gated whether we are able to extract the names of 
genes and proteins from gel diagrams]^] To do so, 
we tokenized the label texts and looked for entries 
in the Entrez Gene database to match the tokens. 
This look-up is done in a case-sensitive way, because 
many names in gel labels are acronyms, where the 
specific capitalization pattern can be critical to iden- 
tify the respective entity. We excluded tokens that 
have less than three characters, are numbers (Arabic 
or Latin), or correspond to common short words (re- 

3 Apart from genes and proteins, we plan to include the 
names of cell lines and drugs in future work. 



trieved from a list of the 100 most frequent words in 
biomedical articles). In addition, we extended this 
exclusion list with 22 general words that are fre- 
quently used in the context of gel diagrams, some of 
which coincide with gene names according to En- 
trezQ 

Since gel electrophoresis is a method to analyze 
genes and proteins, we would expect to find more 
such mentions in gel labels than in other text seg- 
ments of a figure. By measuring this, we get an idea 
of whether the approach works out or not. In ad- 
dition, we manually checked the gene and protein 
names extracted from gel labels after running our 
pipeline on 2000 random figures. 

3.6 Relation Extraction 

For the last step, relation extraction, we cannot 
present concrete results at this point. After recog- 
nizing the named entities, we would have to dis- 
ambiguate them, identify their semantic roles (con- 
dition, measurement or something else), align the 
gel images with the labels, and ultimately quantify 
the degree of expression. To improve the quality of 
the results, combinations with classical text mining 
techniques should be considered. This is all future 
work. We expect to be able to profit to a large ex- 
tent from existing work to disambiguate protein and 



gene names (|Rinaldiet al, 2008 ; Tanabe and Wilbur, 



2002| ) and to detect and analyze gel spots (see the 
existing work mentioned above). 

4 Results 

Table [T] shows the result of the gel detection clas- 
sifier. We generated three different classifiers from 
the training data, one for each of the threshold val- 
ues 0.15, 0.3 and 0.6. Lower threshold values lead 
to higher recall at the cost of precision, and vice 
versa. In the balanced case, we achieved an F-score 
of 75%. To get classifiers with precision or recall 
over 90%, F-score goes down significantly, but stays 
in a sensible range. These two classifiers (thresholds 
0.15 and 0.6) are used in the next step. To interpret 
these values, one has to consider that gel segments 
are greatly outnumbered by non-gel segments. Con- 
cretely, only about 3% are gel segments. Accuracy 

4 These words are: min, hrs, line, type, protein, DNA, RNA, 
mRNA, membrane, gel, fold, fragment, antigen, enzyme, kinase, 
cleavage, factor, blot, pro, pre, peptide, and cell. 
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Precision 


Recall 


F-score 
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0.909 


0.592 


0.30 
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0.926 


0.301 
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Table 1: The results of the gel segment detection classi- 
fiers 

Precision Recall F-score 



0.951 0.379 0.542 
Table 2: The results of the gel panel detection algorithm 



measures take this into account. The accuracy of the 
presented classifiers, measured as the area under the 
ROC curve, is 98.0%Q 

The results of the gel panel detection algorithm 
are shown in Table[2] The precision is 95% at a recall 
of 38%, leading to an F-score of 54%. 

Table [3] shows the results of running the pipeline 
on PubMed Central. We started with about 410000 
articles, the entire open access subset of PubMed 
Central at the time we downloaded them (February 
2012). We successfully parsed the XML files of 94% 
of these articles (for the remaining articles, the XML 
file was missing or not well-formed, or other unex- 
pected errors occurred). The successful articles con- 
tained around 1 100000 figures, for some of which 
our segment detection step encountered image for- 
matting errors or other internal errors, or was just not 
able to detect any segments. We ended up with more 
than 880000 figures, in which we detected about 
86 000 gel panels, i.e. roughly ten out of 100 figures. 
For each of them, we found on average 3.6 labels 
with recognized text. After tokenization, we iden- 
tified about 76 000 gene names in these gel labels, 
which corresponds to 6.8% of the tokens. Consider- 
ing all text segments (including but not restricted to 
gel labels), only 3.3% of the tokens are detected as 
gene names J^] 

Table |4] shows the results of the evaluation of 
the detection algorithm for gene and protein names. 
Almost two-thirds of the detected gene/protein to- 



This measure includes all thresholds from to 1 . 
6 The low numbers are partially due to the fact that a con- 
siderable part of the tokens are "junk tokens" produced by the 
OCR step when trying to recognize characters in segments that 
do not contain text. 
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Table 3: The results of running the pipeline on the open 
access subset of PubMed Central 
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kens (65.3%) were correctly identified. 9% thereof 
were correct but could be more specific, e.g. when 
only "actin" was recognized for "/3-actin". The in- 
correct cases (34.6%) can be split into two classes of 
roughly the same size: some recognized tokens were 
actually not mentioned in the figure but emerged 
from OCR errors; other tokens were correctly rec- 
ognized but incorrectly classified as gene or protein 
references. 

5 Discussion 

The presented results show that we are able to de- 
tect gel segments with high accuracy, which allows 
us to subsequently detect whole gel panels at a high 
precision. The recall of the panel detection step is 
relatively low, but with about 38% still in a reason- 
able range. As mentioned above, we can leverage the 



high number of available figures, which makes pre- 
cision more important than recall. 

Running our pipeline on the whole set of open ac- 
cess articles from PubMed Central, we were able to 
retrieve 85 942 potential gel panels (around 95% of 
which we can expect to be correctly detected). The 
detection of gene and protein names reveals that they 
are more than twice as frequent in gel labels than in 
other text segments, which is consistent with what 
one would expect. This simple gene detection step 
performs reasonably well with a precision of about 
65%, though there is certainly room for improve- 
ment. 

It seems reasonable to assume that these results 
can be combined with existing techniques of term 
disambiguation and gel spot detection at a satisfac- 
tory level of accuracy. We plan to investigate this in 
future work. 

Our results indicate that it is feasible to extract 
relations from gel images, but it is clear that this 
procedure is far from perfect. The automatic anal- 
ysis of bitmap images seems to be the only efficient 
way to extract such relations from existing publi- 
cations, but other publishing techniques should be 
considered for the future. The use of vector graphics 
instead of bitmaps would already greatly improve 
any subsequent attempts of automatic analysis. A 
further improvement would be to establish accepted 
standards for different types of biomedical diagrams 
in the spirit of the Unified Modeling Language, a 
graphical language widely applied in software engi- 
neering since the 1990s. Ideally, the resulting images 
could directly include semantic relations in a formal 
notation, which would make relation mining a trivial 
procedure. If authors are supported by good tools to 
draw diagrams like gel images, this approach could 
turn out to be feasible even in the near future. 

6 Conclusions 

Successful image mining from gel diagrams in 
biomedical publications would unlock a large 
amount of valuable data. Our results show that gel 
panels and their labels can be detected with high ac- 
curacy, applying machine learning techniques and 
hand-coded rules. We also showed that genes and 
proteins can be detected in the gel labels with satis- 
factory precision. 



Based on these results, we believe that this kind 
of image mining is a promising and viable approach 
to provide more powerful query interfaces for re- 
searchers, to gather relations such as protein-protein 
interactions, and to generally complement existing 
text mining approaches. At the same time, we be- 
lieve that an effort towards standardization of sci- 
entific diagrams such as gel images would greatly 
improve the efficiency and precision of image min- 
ing at relatively low additional costs at the time of 
publication. 
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